Deep Learning Grundlagen - Modell und Training

Deep Learning Modell

Architektur: Neuronale Netzstruktur

Neuronales Netz = einfache Dartstellung komplexer Rechnung einfacher Bausteine

LinReg mit Basisfunktionen aus LinReg mit Basis aus Linreg mit Basis aus ...

Modell = Architektur mit Satz von Parametern

mehr zu Neuronalen Netzen

Layer

Layer = Level für Lineare Regression

Mehrere Knoten (Perceptronen)

Knoten = gewichtete Summe & Aktivierungsfunktion

Aktivierungsfunktion

\mathbf{z_2} = \mathbf{W_2} \mathbf{z_1} + \mathbf{b_2}

\mathbf{z_1} = \mathbf{W_1} \mathbf{x} + \mathbf{b_1}

= \mathbf{W_2} (\mathbf{W_1} \mathbf{x} + \mathbf{b_1}) + \mathbf{b_2}

= \mathbf{W_2} \mathbf{W_1} \mathbf{x} + \mathbf{W_2} \mathbf{b_1} + \mathbf{b_2}

= \mathbf{W_{ges}} \mathbf{x} + \mathbf{b_{ges}}

Wofür Aktivierung?

Ohne Aktivierung: N Layer = 1 Layer

Lineare Kombination
von Linearkombinationen
ist eine Linearkombination

Nicht-Lineare Aktivierung
ermöglicht komplexe Modellierung

z_{1,j} = \sum_{i=1}^{n} c_{1,ij} \cdot x_i

z_2 = \sum_{j=1}^{m} c_{2,j} \cdot z_{1,j}

= \sum_{i=1}^{n} \left( \sum_{j=1}^{m} c_{2,j} \cdot c_{1,ij} \right) \cdot x_i

= \sum_{i=1}^{n} c_{\text{ges}, i} \cdot x_i

= \sum_{j=1}^{m} c_{2,j} \cdot \left( \sum_{i=1}^{n} c_{1,ij} \cdot x_i \right)

\text{Sigmoid}\\ \sigma(x) = \frac{1}{1 + e^{-x}}

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

\text{ReLU}(x) = \max(0, x)

Mehr zu Aktivierungen

Wahl der Aktivierung:

Output: Wertebereich Target
Hidden: Effizienz (?)

Aktivierungsfunktion

\text{Sigmoid}\\ \sigma(x) = \frac{1}{1 + e^{-x}} \\ \sigma'(x) = \sigma(x)(1-\sigma(x))

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \\ \tanh'(x) = 1 - \tanh^2(x)

\text{ReLU}(x) = \max(0, x) \\ \text{ReLU}'(x) = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x > 0 \end{cases}

Wahl der Aktivierung:

Output: Wertebereich Target
Hidden: Effizienz (ReLU)

Mehr zu Aktivierungen

Aktivierungsfunktion

\text{ReLU}(x) = \max(0, x) \\ \text{ReLU}'(x) = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x > 0 \end{cases}

Dying ReLU:

Permanent x < 0
ReLU & Ableitung = 0
Perceptron lernt nicht
Wird Unbrauchbar

Aktivierungsfunktion

\text{ReLU}(x) = \max(0, x) \\ \text{ReLU}'(x) = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x > 0 \end{cases}

Dying ReLU:

Permanent x < 0
ReLU & Ableitung = 0
Perceptron lernt nicht
Wird Unbrauchbar

PReLU, RReLU, ELU, SELU, GELU, Swish, Mish, ReLU6, ...

\text{Leaky ReLU}(x) = \max(\alpha x, x) \\ \text{Leaky ReLU}'(x) = \begin{cases} \alpha & \text{if } x < 0 \\ 1 & \text{if } x > 0 \end{cases}

\text{ALReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ |\alpha \cdot x| & \text{if } x < 0 \end{cases} \\ \text{ALReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ -\alpha & \text{if } x < 0 \end{cases}

nützlich bei sehr seltener Aktivierung

Aktivierungsfunktion

Implementation

	Pytorch	Tensorflow
Eigenschaften	Pythonic, einfache Syntax schnelleres Training dynamischer Berenchnungsgraph höhere Flexibilität	Skalierbar Speichereffizient statisch oder dynamisch
Hauptanwendung	Forschung Prototyping	Grossprojekte Produktion
Community	Forschung	Industrie
Pakete	TorchText, TorchVision, TorchAudio	TF Extended, TF Lite, TF Serving

Beide Frameworks sehr nützlich & weit verbreitet

Mathematik identisch & Aufbau sehr ähnlich

Wahl meist durch Arbeitsumfeld bestimmt

Architektur entwerfen

Aufgabe klar definieren (Klassifikation, Regression, Erkennung, ...)
Ein- und Ausgabedimension festlegen (MNIST: In: 784; Out: 10)
Geeignete Art von Schichten bestimmen (Linear, Convolutional, ...)
Anzahl Schichten und Neuronen pro Schicht festlegen
Aktivierungsfunktionen festlegen (Hidden & Output)

Wenn möglich, bereits existierende Architektur / Modelle verwenden

Implementation: Architektur

MNIST Classifier

from torch import nn
import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        
        
        
        

    def forward(self, x):
        
        
        
        
        return x

model = Classifier()
output = model(data)

Pytorch (Meta)

MNIST Classifier

Implementation: Architektur

from torch import nn
import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        
        
        
        self.fc4 = nn.Linear(784, 10)

    def forward(self, x):
        
        
        
        x = self.fc4(x)
        return x

model = Classifier()
output = model(data)

Pytorch (Meta)

MNIST Classifier
10 Outputs

Implementation: Architektur

from torch import nn
import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 64)
        self.fc4 = nn.Linear(64, 10)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc4(x)
        return x

model = Classifier()
output = model(data)

Pytorch (Meta)

MNIST Classifier
10 Outputs
3 Hidden Layer

Implementation: Architektur

from torch import nn
import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 64)
        self.fc4 = nn.Linear(64, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x

model = Classifier()
output = model(data)

Pytorch (Meta)

MNIST Classifier
10 Outputs
3 Hidden Layer

Implementation: Architektur

from torch import nn
import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 64)
        self.fc4 = nn.Linear(64, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.softmax(self.fc4(x), dim=1)
        return x

model = Classifier()
output = model(data)

Pytorch (Meta)

MNIST Classifier
10 Outputs
3 Hidden Layer
Softmax activation

Implementation: Architektur

from torch import nn
import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 64)
        self.fc4 = nn.Linear(64, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.softmax(self.fc4(x), dim=1)
        return x

model = Classifier()
output = model(data)

Pytorch (Meta)

import tensorflow as tf
from tensorflow.keras import layers

class Classifier(tf.keras.Model):
    def __init__(self):
        super(Classifier, self).__init__()
        self.fc1 = layers.Dense(64, activation='relu')
        self.fc2 = layers.Dense(64, activation='relu')
        self.fc3 = layers.Dense(64, activation='relu')
        self.fc4 = layers.Dense(10, activation='softmax')

    def call(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc4(x)
        return x

model = Classifier()
model.build((None, 784))
model(data)

Tensorflow (Google)

MNIST Classifier
10 Outputs
3 Hidden Layer
Softmax activation

Implementation: Architektur

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
    nn.Softmax(dim=1)
)

output = model(data)

Pytorch (Meta)

from tensorflow.keras import models

model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(784,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model(data)

Tensorflow (Google)

Implementation: Architektur

MNIST Classifier
10 Outputs
3 Hidden Layer
Softmax activation

Trainingsloop

Daten laden (batch)
Modell anwenden (forward)
Loss berechnen
Updates berechnen (backward)
Update durchfüren

for images, labels in trainloader:

    prediction = model(images)

    loss = criterion(prediction, labels)

    optimizer.zero_grad()
    loss.backward()

    optimizer.step()

trainloader = DataLoader(trainset, batch_size=256, shuffle=True)

Trainingsloop entwerfen

Aufgabe klar definieren
Lossfunktion bestimmen
Berechnungsschritte definieren

Loss Funktion

Definiert das Ziel des Trainings
Ziel: Loss minimieren
erlaubt Vergleich von Modellen
verschiedene Losses für verschiedene Aufgaben

Loss Funktion

Mean Squared Error (MSE):
mittlerer quadratische Abweichung

MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

Loss Funktion

Mean Squared Error (MSE):
mittlerer quadratische Abweichung
Binäre Cross-Entropy (BCE):
vergleich von Wahrscheinlichkeit einer Klasse

MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

BCE = -\sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]

Loss Funktion

Mean Squared Error (MSE):
mittlerer quadratische Abweichung
Binäre Cross-Entropy (BCE):
vergleich von Wahrscheinlichkeit einer Klasse
Cross-Entropy (CE):
vergleich von Wahrscheinlichkeiten mehrerer Klassen

MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

BCE = -\sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]

CE = -\sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})

mehr zu

Hintergrund

und

Varianten

Implementation: Training

Loss: CrossEntropy

criterion = nn.CrossEntropyLoss()

Pytorch

Loss: CrossEntropy

Implementation: Training

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

Pytorch

Loss: CrossEntropy
Optimizer: Adam

Implementation: Training

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

for e in range(epochs):

Pytorch

Loss: CrossEntropy
Optimizer: Adam

Implementation: Training

Epoche: alle Daten trainieren

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

for e in range(epochs):
    
    for images, labels in trainloader:

Pytorch

Loss: CrossEntropy
Optimizer: Adam

Implementation: Training

Batchweise Input & Target

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

for e in range(epochs):
    
    for images, labels in trainloader:
        prediction = model(images)
        loss = criterion(prediction, labels)

Pytorch

Loss: CrossEntropy
Optimizer: Adam

Forward-Pass

Implementation: Training

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

for e in range(epochs):
    
    for images, labels in trainloader:
        prediction = model(images)
        loss = criterion(prediction, labels)

        optimizer.zero_grad()
        loss.backward()

Pytorch

Loss: CrossEntropy
Optimizer: Adam

Backward-Pass

Implementation: Training

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

for e in range(epochs):
    
    for images, labels in trainloader:
        prediction = model(images)
        loss = criterion(prediction, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Pytorch

Loss: CrossEntropy
Optimizer: Adam

Update

Implementation: Training

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        prediction = model(images)
        loss = criterion(prediction, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

Pytorch

Loss: CrossEntropy
Optimizer: Adam

Implementation: Training

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        prediction = model(images)
        loss = criterion(prediction, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

Pytorch

Loss: CrossEntropy
Optimizer: Adam

model.compile(optimizer=optimizers.Adam(learning_rate=0.003),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=1, batch_size=64)

print(f'Training loss: {history.history["loss"][0]}')

Tensorflow

Implementation: Training

Update

Ableitung der Kosten -> Richtung für Verbesserung -> Update

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Gradient Descent

\text{Lernrate } \alpha

Update

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

\text{Lernrate } \alpha

Gradient Descent

Ableitung der Kosten -> Richtung für Verbesserung -> Update

Update

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

\text{Lernrate } \alpha

Gradient Descent

Ableitung der Kosten -> Richtung für Verbesserung -> Update

Update

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

\text{Lernrate } \alpha

Gradient Descent

Ableitung der Kosten -> Richtung für Verbesserung -> Update

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Loss = \sum\limits_i (y_i - \hat{y}_i)^2 \\ \text{Forward: }\hat{y} = f(x)

x \text{: Input Daten} \\ y \text{: Ziel Daten} \\ \hat{y} \text{: Vorhersage} \\ f \text{: Modell} \\ \theta \text{: Parameter}

Gradient Descent

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Loss = \sum\limits_i (y_i - \hat{y}_i)^2 \\ \text{Forward: }\hat{y} = f(x) = a(z) = a(b + w \cdot x) \\~\\

b \text{: Bias} \\ w \text{: Gewichte} \\ a \text{: Aktivierung}

Gradient Descent

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Loss = \sum\limits_i (y_i - \hat{y}_i)^2 \\ \text{Forward: }\hat{y} = f(x) = a(z) = a(b + w \cdot x) \\~\\ \text{Backward: }\frac{\partial Loss}{\partial b}

Gradient Descent

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Loss = \sum\limits_i (y_i - \hat{y}_i)^2 \\ \text{Forward: }\hat{y} = f(x) = a(z) = a(b + w \cdot x) \\~\\ \text{Backward: }\frac{\partial Loss}{\partial b} = \frac{\partial Loss}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial b} \\~\\

Gradient Descent

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Gradient Descent

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Loss = \sum\limits_i (y_i - \hat{y}_i)^2 \\ \text{Forward: }\hat{y} = f(x) = a(z) = a(b + w \cdot x) \\~\\ \text{Backward: }\frac{\partial Loss}{\partial b} = \frac{\partial Loss}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial b} \\~\\ = \sum\limits_i 2(y_i - \hat{y}_i) \cdot \frac{\partial a}{\partial z}

Gradient Descent

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Loss = \sum\limits_i (y_i - \hat{y}_i)^2 \\ \text{Forward: }{\color{red} \hat{y}_i = f({\color{orange}x_i}) = {\color{red}a_1(z_1({\color{orange}x_i}))} = {\color{red} a_1(b_1 + w_1 \cdot {\color{orange}x_i})}} \\~\\ \text{Backward: }\frac{\partial Loss}{\partial b_1} = \frac{\partial Loss}{\partial a_1} \cdot {\color{red} \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1} } \\~\\ = \sum\limits_i 2(y_i - \hat{y}_i) {\color{red} \frac{\partial a_1}{\partial z_1} }

Gradient Descent

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Loss = \sum\limits_i (y_i - \hat{y}_i)^2 \\ \text{Forward: }{\color{cyan} \hat{y}_i = f({\color{orange}x_i}) = {\color{cyan}a_2(z_2({\color{red}a_1(z_1({\color{orange}x_i}))}))} = {\color{cyan} a_2(b_2 + w_2 \cdot {\color{red} a_1(b_1 + w_1 \cdot {\color{orange}x_i})})}} \\~\\ \text{Backward: }\frac{\partial Loss}{\partial b_1} = \frac{\partial Loss}{\partial a_2} \cdot {\color{cyan} \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot } {\color{red} \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1} } \\~\\ = \sum\limits_i 2(y_i - \hat{y}_i) \cdot{\color{cyan} \frac{\partial a_2}{\partial z_2} \cdot w_2 \cdot } {\color{red} \frac{\partial a_1}{\partial z_1} }

Gradient Descent

Optimizer

\theta \leftarrow \theta - \alpha \cdot \frac{\partial \text{Loss}}{\partial \theta}

Loss = \sum\limits_i (y_i - \hat{y}_i)^2 \\ \text{Forward: }{\color{lightgreen} \hat{y}_i = f({\color{orange}x_i}) = a_3(z_3({\color{cyan}a_2(z_2({\color{red}a_1(z_1({\color{orange}x_i}))})) })) = a_3(b_3 + w_3 \cdot {\color{cyan} a_2(b_2 + w_2 \cdot {\color{red} a_1(b_1 + w_1 \cdot {\color{orange}x_i})})})} \\~\\ \text{Backward: }\frac{\partial Loss}{\partial b_1} = \frac{\partial Loss}{\partial a_3} \cdot {\color{lightgreen} \frac{\partial a_3}{\partial z_3} \cdot \frac{\partial z_3}{\partial a_2} \cdot } {\color{cyan} \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot } {\color{red} \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1} } \\~\\ = \sum\limits_i 2(y_i - \hat{y}_i) \cdot {\color{lightgreen} \frac{\partial a_3}{\partial z_3} \cdot w_3 \cdot } {\color{cyan} \frac{\partial a_2}{\partial z_2} \cdot w_2 \cdot } {\color{red} \frac{\partial a_1}{\partial z_1} }